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TIME-SCALE MODIFICATION OF SIGNALS APPLYING TECHNIQUES SPECIFIC TO DETERMINED 

SIGNAL TYPES 



Field of the Invention 

The invention relates to the time-scale modification (TSM) of a signal, in 
particular a speech signal, and more particularly to a system and method that employs 
different techniques for the time-scale modification of voiced and un-voiced speech. 

5 

Background to the Invention 

Time-scale modification (TSM) of a signal refers to compression or expansion 
of the time scale of that signal. Within speech signals, the TSM of the speech signal expands 
or compresses the time scale of the speech, while preserving the identity of the speaker 
1 0 (pitch, format structure). As such, it is typically explored for purposes where alteration of the 
pronunciation speed is desired. Such applications of TSM include test-to-speech synthesis, 
foreign language learning and film/soundtrack post synchronisation. 

Many techniques for fulfilling the need for high quality TSM of speech signals 
are known and examples of such techniques are described in E. Moulines, J. Laroche, "Non 
1 5 parametric techniques for pitch scale and time scale modification of speech". In Speech 
Communication (Netherlands) Vol 16, No. 2 pl75-205 1995. 

Another potential application of TSM techniques is speech coding which, 
however, is much less reported. Within this application, the basic intention is to compress the 
time scale of a speech signal prior to coding, reducing the number of speech samples that 
20 need to be encoded, and to expand it by a reciprocal factor after decoding, to reinstate the 
original timescale. This concept is illustrated in Figure 1. Since the time-scale compressed 
speech remains a valid speech signal, it can be processed by an arbitrary speech coder. For 
example, speech coding at 6 kbit/s could now be realised with a 8 kbit/s coder, preceeded by 
25% time-scale compression and succeeded by 33% time-scale expansion. 
25 The use of TSM in this context has been explored in the past, and fairly good 

results were claimed using several TSM methods and speech coders [l]-[3]. Recently, 
improvements have been made both to TSM and speech coding techniques, where these two 
have mostly been studied independently from each other. 
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As detailed in Moulines and Laroche, as referenced above, one widely used 
TSM algorithm is synchronised overlap-add (SOLA), which is an example of a waveform 
approach algorithm. Since its introduction [4], SOLA has evolved into a widely used 
algorithm for TSM of speech. Being a correlation method, it is also applicable to speech 
5 produced by multiple speakers or corrupted by background noise, and to some extent to 
music. 

With SOLA, an input speech signal s is analysed as a sequence of N-samples 
long overlapping frames x/ (/ = 0, . . m), consecutively delayed by a fixed analysis period of 
Say samples (Sa < N) The starting idea is that s can be compressed or expanded by outputting 

10 these frames while now successively shifting them by a synthesis period 5^, which is chosen 
such that Ss < Sa, respectively Ss > Sa (Ss < N). The overlapping segments would be first 
weighted by two amplitude complementary functions then added up, which is a suitable way 
of waveform averaging. Figure 2 illustrates such an overlap-add expansion technique. The 
upper part shows the location of the consecutive frames in the input signal. Themiddle part 

1 5 demonstrates how these frames would be re-positioned during the synthesis, employing in 
this case two halves of a Hanning window for the weighting. Finally, the resulting time-scale 
expanded signal is shown in the lower peut. 

The actual synchronisation mechanism of SOLA consists of additionally 
shifting each jc, during the synthesis, to yield similarity of the overlapping waveforms. 

20 Explicitly, a frame x, will now start contributing to the output signal at position /SlfH- Jt,, where 
ki is found such that the normalised cross-correlation given by Equation 1 is maximal for k - 
ki. 

ZniS,^k-^ns[iS,^J] 
[k] = -^(0 <k<N/2) (Equation 1) 

Z''[iS^^j]'Zs\iS,^k-^j] 

25 In this equation, J denotes the output signal while L denotes the length of the 

overlap corresponding.to a particular lag k in the given range [1], Having found ^„ the 
synchronisation parameters, the overlapping signals are averaged as before. With a large 
number of frames the ratio of the output and input signal length will approach the value Sj/Sa, 
hence defining the scale factor a . 
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When SOLA compression is cascaded with the reciprocal SOLA expansion, 
several artefacts are typically introduced into the output speech, such as reverberation, 
artificial tonality and occasional degradation of transients. 

The reverberation is associated with voiced speech, and can be attributed to 

5 waveform averaging. Both compression and the succeeding expansion average similar 
segments. However, similarity is measured locally, implying that the expansion does not 
necessarily insert additional waveform in the region where it was "missing". This results in 
waveform smoothing, possibly even introducing new local periodicity. Furthermore, frame 
positioning during expansion is designed to re-use same segments, in order to create 

1 0 additional waveform. This introduces correlation in unvoiced speech, which is often 
perceived as an artificial "tonality".' ' 

Artefacts also occur in speech transients, i.e. regions of voicing transition, 
which usually exhibit an abrupt alteration of the signal energy level. As the scale factor 
increases, so does the distance between */Sa' and 'iSs' which may impede alignment of 

1 5 similar parts of a transient for averaging. Hence, overlapping distinct parts of a transient 
causes its "smearing", endangering proper perception of its strength arid timing. 

In [5], [6], it was reported that a companded speech signal of a good quality 
can be achieved by employing the kiS that are obtained during SOLA compression. So, quite 
opposite to what is done by SOLA, the N-samples long frames i,. would now be excised 

20 from the compressed signal 1 at time instants iSs + fc, and re-positioned at the original time 
instants /Sa (while averaging the overlapping samples similar as before). The maximal cost of 
transmitting/storing all kiS is given by Equation 2, where Ts is the speech sampling period 
and [ ] represents the operation of rounding towards the nearest-higher integer. 



25 

5„ • T; sec 



^^^^ -) (Equation 2) 



frame 



It has also been reported that exclusion of transients from high (i.e. > 30%) 
SOLA compression or expansion yields improved speech quality. [7] 

It will be appreciated therefore that presently several techniques and 
30 approaches exist that can successfully (e.g. giving good quality) be employed for 

compressing or expanding the time-scale of signals. Although described specifically with 
reference to speech signals, it will be appreciated that this description is of an exemplary 
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embodiment of a signal type and the problems associated with speech signals are also 
applicable to other signal types. When used for coding purposes, where the time-scale 
compression is followed by time-scale expansion (time-scale companding), the performance 
of prior art techniques degrade considerably. The best performance for speech signals is 
S generally obtained from time-domain methods, among which SOLA is widely used, but 

problems still exist using these methods, some of which have been identified above. There is, 
therefore, a need to provide an improved method and system for time scale modifying a 
signal in a manner specific to the components making up that signal.. 

10 Summary of the Invention 

Accordingly the present invention provides a method for time scale modifying 
a signal as detailed in claim 1. 

By providing a method that analyses individual frame segments within a signal 
and applies different algorithms to specific signal types it is possible to optimise the 
15 modification of the signal. Such application of specific modification algorithms to specific 
signal types enables a modification of the signal in a manner which is adapted to cater for 
different requirements of the individual component segments that make up the signal. 

In a preferred embodiment of the present invention, the method is applied to 
speech signals and the signal is analysed for voiced and un-voiced components with different 
20 expansion or compression techniques being utilised for the different types of signal. The 
choice of technique is optimised for the specific type of signal. 

The present invention additionally provides an expansion method according to 
claim 9. The expansion of the signal is effected by the. splitting of the signal into portions and 
the insertion of noise between the portions. Desirably, the noise is synthetically generated 
25 noise rather than generated from the existing samples, which allows for the inserion of a 
noise sequence having similar spectral and energy properties to that of the signal 
components. 

The invention also provides a method of receiving an audio signal, the method 
utilising the time scale modification method of claim 1. 
30 The invention also provides a device adapted to effect the method of claim 1 . 

These and other features of the present invention will be better understood 
with reference to the following drawings. 
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Brief Description of the Drawings 

Figure 1 is a schematic showing the known use of TSM in coding applications, 
Figure 2 shows time scale expansion by overlap according to a prior art 

implemenialion, 

5 Figure 3 is a schematic showing time scale expansion of unvoiced speech by 

adding appropriately modelled synthetic noise according to a first embodiment of the present 
invention. 

Figure 4 is a schematic of TSM-based speech coding system according to an 
embodiment of the present invention, 
1 0 Figure 5 is a graph showing the segmentation and windowing of unvoiced 

speech for LPC computation 

Figure 6 shows a parametric time-scale expansion of unvoiced speech by 

factor 6 > 1 , 

Figure 7 is an exeunple of time scale companded unvoiced speech, where the 
1 5 noise insertion method of the present invention has been used for the purpose of time scale 
expansion, and TDHS for the purpose of time scale compression. 

Figure 8 is a schematic of a speech coding system incorporating TSM 
according to the present invention. 

Figure 9 is a graph showing how the buffer holding the input speech is 
20 updated by left-shifting of the Sa samples long frames. 

Figure 10 shows the flow of the input (-right) and output (-left) speech in the 

compressor. 

Figure 1 1 shows a speech signal and the corresponding voicing contour 

( voiced =1), 

25 Figure 12 is an illustration of different buffers during the initial stage of 

expansion, which follows directly the compression illustrated in Figure 10 

Figure 13 shows the example where a present unvoiced frame is expanded 
using the parametric method only if both past and future frames are unvoiced as well, and 
Figure 14 shows how during voiced expansion, the present Ss samples long 
30 frame is expanded by outputting front Sa samples from 2 Sa samples long buffer Y. 



Detailed Description of the Drawings 

A first aspect of the present invention provides a method for time-scale 
modification of signals and is particularly suited for audio signals and is particular to the 
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expansion of unvoiced speech, and is designed to overcome the problem of artificial tonality 
introduced by the "repetition" mechanism which is inherently present in all time-domain 
methods. The invention provides for the lengthening of the time-scale by inserting an 
appropriate amount of synthetic noise that reflects the spectral and energy properties of the 
5 input sequence. The estimation of these properties is based on LPC (Linear Predictive 

Coding) and variance matching. In a preferred embodiment the model parameters are derived 
from the input signal, which may be an already compressed signal, thereby avoiding the 
necessity for their transmission. Although it is not intended to limit the invention to any one 
theoretical analysis, it is thought that only a limited distortion of the above mentioned 
1 0 properties of an unvoiced sequence is caused by a compression of its time-scale. Figure 4 

shows a schematic overview of the system of the present invention. The upper part shows the 
processing stages at the encoder side. A speech classifier, represented by the block "VAJV", 
is included to determine unvoiced and voiced speech (frames). All speech is compressed 
using SOLA, except for the voiced onsets, which are translated. By the term translated, as 
1 5 used within the present specification, it is meant that these frame components are excluded 
from TSM . Synchronisation parameters and voicing decisions are transmitted through a side 
channel. As shown in the lower part, they are utilised to identify the decoded speech (frames) 
and choose the appropriate expansion method. It will be appreciated, therefore, that the 
present invention provides for the application of different algorithms to different signal types, 
20 for example in one preferred application voiced speech is expanded by SOLA, while 
unvoiced speech is expanded using the pareunetric method. 

Parametric Modelling Of Unvoiced Speech 

Linear predictive coding is a widely applied method for speech processing, 
employing the principle of predicting the current sample from a linear combination of 
previous samples. It is described by Equation 3.1, or, equivalently, by its z-transformed 
counterpart 3.2. In Equation 3.1, s and s respectively denote an original signal and its LPC 
estimate, and e the prediction error. Further, A/ determines the order of prediction, and a; are 
the LPC coefficients. These coefficients are derived by some of the well-known algorithms 
([6], 5.3), which are usually based on least squares error (LSE) minimisation, i.e. 
minimisation of Z„e^[n] 

M 

s[n] = s[n]'^e[n] = X'^t'M'^ " ^ + ^["] (equation 3.1) 

/a! 
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^(^> = ItT = = (equation 3.2) 

Using the LPC coefficients, a sequence s can be approximated by the synthesis 
procedure described by Equation 3.2. Explicitly, the filter H(z) (often denoted as 1/A(z)) is 
5 excited by a proper signal e, which, ideally, reflects the nature of the prediction error. In the 
case of unvoiced speech, a suitable excitation is normally distributed zero-mean noise. 

Eventually, to ensure a proper amplitude level variation of the synthetic 
sequence, the excitation noise is multiplied by a suitable gain G. Such a gain is conveniently 
computed based on variance matching vsdth the original sequence s, as described by 
10 Equations 3.3. Usually, the mean value J of an unvoiced soxmd s can be assumed to be equal 
to 0. But, this need not be the case for its arbitrary segment, especially if s had been 
submitted to some time-domain weighted averaging (for the purpose of time-scale 
modification) first. 



15 



(equation 3.3) 



The described way of signal estimation is only accurate for stationary signals. 
Therefore, it should only be applied to speech frames, which are quasi-stationauy. When LPC 
computation is concerned, speech segmentation also includes windowing, which has the 

20 purpose of minimising smearing in the frequency domain. This is illustrated in Figure 5, 

featuring a Hamming window, where N denotes the frame length (typically 1 5-20ms) and T 
the ansdysis period. 

Finally, it should be noted that the gain and LPC computation need not 
necessarily be performed at the same rate, as the time and frequency resolution that is needed 

25 for an accurate estimation of the model parameters does not have to be the same. Typically, 
the LPC parameters are updated every 10 ms, wheresis the gain is updated much faster (e.g. 
2.5 ms). Time resolution (described by the gains) for unvoiced speech is perceptually more 
important than frequency resolution, since unvoiced speech typically has more higher 
frequencies than voiced speech. 
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A possible way to realise time-scale modification of unvoiced speech utilising 
the previously discussed parametric modelling is to perform the synthesis at a different rate 
than the analysis, and in Figure 6, a time-scale expansion technique that exploits this idea is 
illustrated. The model parameters are derived at a rate 1/T{ 1), and used for the synthesis (3) 
5 at rate 1/bT, The Hamming windows deployed during the synthesis are only used to illustrate 
the rate change. In practice, power complementary weighting would be most appropriate. 
During the analysis stage, the LPC coefficients and the gain are derived from the input signal, 
here at a same rate. Specifically, after each period of T samples, a vector of LPC coefficients 
a and a gain G are computed over the length of A/" samples, i.e. for an A^-samples long frame. 
10 In a way, this can be viewed as defining a 'temporal vector space' F, according to Equation 
3.4, which is for simplicity shown as a two-dimensional signal. 

V = V(a(t), G(t)) (a = [ai, aM]. t = nT, n = 1 , 2, ...)( equation 3.4) 

1 5 To obtain time-scale expansion by a scale factor 6 (6 > 1), this vector space is 

simply 'down-sampled* by the same factor, prior to the synthesis. Explicitly, after each period 
of bT samples, an element of V is used for the synthesis of a new N samples-long frame. 
Hence, compared to the analysis frames, the synthesis frames will be overlapping in time by 
a smaller amount. To demonstrate this, the frames have been marked by using the Hamming 

20 windows again. In practice, it vsall be appreciated that the overlapping parts of the synthesis 
frames may be averaged by applying the power-complementary Weighting instead, deploying 
the appropriate windows for that purpose. It will be appreciated that by performing the 
synthesis at a faster rate than the analysis that time-scale compression could be achieved in a 
similar way. 

25 It will be appreciated by those skilled in the art that the output signal produced 

by applying this approach is an entirely synthetic signal. As a possible remedy to reduce the 
artefacts, which are usually perceived as an increased noisiness, a faster update of the gain 
could serve. A more effective approach, however, is to reduce the amount of synthetic noise 
in the output signal. In the case of time-scale expansion, this can be accomplished as detailed 

30 below. 

Instead of synthesising whole frames at a certain rate, in one embodiment of 
the present invention a method is provided for the addition of an appropriate and smaller 
amount of noise to be used to lengthen the input frames. The additional noise for each frame 
is obtained similar as before, namely from the models (LPC coefficients and the gain) 
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derived for that frame. When expanding compressed sequences, in particular, the window 
length for LPC computation may generally extend beyond the frame length. This is 
principally meant to give the region of interest a sufficient weight. Subsequently, a 
compressed sequence which is being analysed is assumed to have sufficiently retained the 
spectral and energy properties of the original sequence from which it has been obtained. 

Using the illustration from Figure 3, firstly, an input unvoiced sequence s[n] is 
submitted to segmentation into frames. Each of the I-samples long input frames A^^^ will 
be expanded to a desired length of samples (Le = a^ L, where a > 1 is the scale factor). 
In accordance with the earlier explanation, the LPC analysis v^U be performed on the 



1 0 corresponding, longer frames 5,5,.^, , which, for that purpose, are windowed. 



The time-scale expanded version of one particular frame A^A^^^ (denoted by 
Si) is then obtained as follows. A Le samples long, zero-mean and normally distributed (oe = 
1) noise sequence is shaped by the filter l/A(z), defined by the LPC coefficients derived from 



5-5,,., . Such shaped noise sequence is then given gain and mean values which are equal to 



1 5 those of frame A^A^^i . Computation of these parameters is represented by block "G". Next, 



frame 'a~A~x split into two halves, namely A^C^ and C^A.^^ , and the additional noise is 
inserted in between them. This added noise is excised from the middle of the previously 
synthesised noise sequence of length L£. Practically, it will be appreciated that these actions 
can be achieved by proper windowing and zero-padding, giving each sequence the same 

20 length of L^ samples, then simply adding them all together. 

In addition, the windows dravsm by dashed lines suggest that averaging 
(cross-fade) can be performed around the joints of the region where the noise is being 
inserted. Still, due to the noise-like character of all involved signals, possible (perceptual) 
benefits of such 'smoothing' in the transition regions remain bounded. 

25 In Figure 7, the approach explained above is demonstrated by an exsunple. 

First, TDHS compression has been applied to an original unvoiced sequence s[n], producing 
Sc[n] as result. The original time-scale has then been re-instated by applying expansion to 
Sc[n]. The noise insertion is made apparent by zooming in on two particular frames. 

It will be understood that the above described way of noise insertion is in 

30 accordance with the usual way of performing LPC analysis, employing the Hamming 

window, and since the central part of the frame is given the highest weight, inserting the 
noise in the middle seems logical. However, if the input frame marks a region close to an 
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acoustical event, like a voicing transition, then inserting the noise in a different way may be 
more desirable. For example, if the frame consists of unvoiced speech gradually transfomiing 
into a more Voiced-like' speech, then insertion of synthetic noise closer to the beginning of 
the frame (where the most noise-like speech is located) would be most appropriate. An 
5 asymmetrical window putting the most weight on the left part of the frame could then be 
suitably used for the purpose of LPC analysis. It will be appreciated therefore that the 
insertion of noise in different regions of the frame may be considered for different types of 
signal. 

Figure 8 shows a TSM-based coding system incorporating all the previously 

10 explained concepts. The system comprises of a (tuneable) compressor and a corresponding 
expander allowing an arbitrary speech codec to be placed in between them. The time-scale 
companding is desirably realised combining SOLA, parsmietric expansion of unvoiced 
speech and the additional concept of translating voiced onsets. It will also be appreciated that 
the speech coding system of the present invention can also be used independantly for the 

1 5 parametric expansion of unvoiced speech. In the following sections, details concerning the 
system set-up and realisation of its TSM stages are given, including a comparison with some 
standard speech coders. 

The signal flow can be described as follows. The incoming speech is 
submitted to buffering and segmentation into frames, to suit the succeeding processing 

20 stages. Namely, by performing a voicing analysis on the buffered speech (inside the block 
denoted by 'V/UV') and shifting the consecutive frames inside the buffer, a flow of the 
voicing information is created, which is exploited to classify speech parts and handle them 
accordingly. Specifically, voiced onsets are translated, while all other speech is compressed 
using SOLA. The out-coming frames are then passed to the codec (A), or bypass the codec 

25 (B) directly to the expander. Simultaneously, the synchronisation parameters are transmitted 
through a side channel. They are used to select and perform a certain expansion method. That 
is, voiced speech is expanded using SOLA frame shifts ki. During SOLA, the N-samples long 
analysis freimes X/ are excised from an input signal at times / iSa, and output at the 
corresponding times /r,+/5s. Eventually, such modified time-scale can be restored by the 

30 opposite process, i.e. by excising samples long frames x,. from the time-scale modified 
signal at times ki + Ss , and outputting them at times / 5a. This procedure can be expressed 
through Equation 4.0 where J and s respectively de-note the TSM-ed and reconstructed 
version of an original signal s. It is assumed here that ko = 0, in accordance with the indexing 
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of k, starting from m = 1. X;[n] may be assigned multiple values, i.e. samples from different 
frames which will overlap in time, and should be averaged by cross-fade. 

X. [n] = s[n + iS^ ] = + iS, + (/ = 0^) (n = 0,N -X) Equation 4.0 

5 

By comparing the consecutive overlap-add stages of SOLA and the 
reconstruction procedure outlined above, it can easily be seen that x \ and Xi will generally not 
be identical. It will therefore be appreciated that these two processes do not exactly form a 
"1-1" transformation pair. However, the quality of such reconstruction is notably higher 

1 0 compared to merely applying SOLA that uses a reciprocal Ss =Sa ratio. 

The unvoiced speech is desirably expanded using the parametric method 
previously described. It should be rioted that the translated speech segments are used to 
realise the expansion, instead of simply being copied to the output Through suitable 
buffering and manipulation of all received data, a synchronised processing results, where 

1 5 each incoming frame of the original speech will produce a frame at the output (after an initial 
delay). 

It will be appreciated that a voiced onset may be simply detected as £uiy 
transition from unvoiced-like to voiced-like speech. 

Finally, it should be noted that the voicing analysis could in principle be 

20 performed on the compressed speech, as well, and that process could therefore be used to 

eliminate the need for transmitting the voicing information. However, such speech would be 
rather inadequate for that purpose, because relatively long analysis frames must usually be 
analysed in order to obtain reliable voicing decisions. 

Figure 9 shows the management of a input speech buffer, according to the 

25 present invention. The speech contained in the buffer at a certain time is represented by 

segment 0^^ . The segment OA/, underlying the Hamming window, is submitted to voicing 
analysis, providing a voicing decision which is associated to samples in the centre. The 
window is only used for illustration, and does not suggest the necessity for weighting of the 
speech, an example of the techniques which may be used for any weighting may be found in 

30 R.J. McAulay and T.F. Quatieri, "Pitch estimation and voicing detection based on a 

sinusoidal speech model", IEEE Int. Conf. on Acoustics Speech and Signal Processing, 

1990. The acquired voicing decision is attributed to Sa samples long segment A-^ , where 
V <S^ and - V\ « . Further, the speech is segmented in Sa samples long frames 
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Af Ai^^ (i = O, .-.,3), enabling a convenient realisation of SOLA and buffer management. 
Specifically, ^0-^2 A^ A^ will play the role of two consecutive SOLA analysis frames x,- 
andxi+l, while the buffer will be updated by left-shifting of frames A^A^^^ (/ = 0, 1,2) and 

putting new samples at the 'emptied' position of A^A^ . 
5 The compression can easily be described using Figure 10, where four initial 

iterations are illustrated. The flow of the input and output speech can be respectively 
followed on the right and left side of the figure, where some familiar features of SOLA are 
apparent. Among the input frames, voiced ones are marked by "1" and unvoiced by "0". 

Initially, the buffer contains a zero signal. Then, a first frame d(A^A^) is read, 

1 0 in this case announcing a voiced segment. Note that the voicing of this frame will be known 
only after it has arrived at the position of A^ A^ , in accordance with the earlier described way 
of performing the voicing analysis. Thus, the algorithmical delay amounts 35a samples. On 
the left side, the continuously changing gray-painted frame, hence synthesis frame, represent 
the front samples of the buffer holding the output (synthesis) speech at a particular time. (As 

15 will become clear, the minimal length of this buffer is (ki)max + 25^ = 3Sa samples.) In 

accordance with SOLA, this frame is updated by overlap add with the consecutive analysis 
frames, at the rate determined by Ss (Ss< Sa)- So, after first two iterations, the Ss samples long 

frames A^a^ and aj^j will consecutively have been output, as they become obsolete for new 

updates, respectively by the analysis frames A^A^ and AjA^ . This SOLA compression will 
20 continue as long as the present voicing decision has not changed from 0 to 1 , which here 
happens in step 3. At that point, the whole synthesis frame will be output, except for its last 
Sa samples, to which last Sa samples from the current analysis frame axe appended. This can 

be viewed as re-initialisation of the synthesis frame, now becoming o^A^ . With it, a new 
SOLA compression cycle starts in step 4, etc. 
25 It can be seen that, while maintaining speech continuity, much of frame a^A^ 

will be translated, as well as several input fraunes succeeding it, thanks to SOLAs slow 
convergence. These parts exactly correspond to the region which is most likely to contain a 
voiced onset. 

It can now be concluded that after each iteration the compressor will output an 
30 "information triplet", consisting of a speech frame, SOLA k and a voicing decision 

corresponding to the front frame in the buffer. Since no cross-correlation is computed during 
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the translation, Jt,- = 0 will be attributed to each translated frame. So, by denoting speech 
frames by their length, the triplets produced in this case are {Ss. koy 0), (5„ A:/, 0), (5^+ A/, 0, 0) 
and (5',, Jfci, 1). Note that the transmission of (most) ks acquired during the compression of 
unvoiced speech is superfluous, because (most) unvoiced frames will be expanded using the 
5 parametric method. 

The expander is desirably adapted to keep the track of the synchronisation 
parameters in order to identify the incoming frames and handle them appropriately. 

The principal consequence of translation of voiced onsets is that it "disturbs" a 
continuous time-scale compression. It will be appreciated that all compressed frames have 
10 the equal length of Ss samples, while the length of translated frames is variable. This could 
introduce difficulties in maintaining a constant bit-rate when the time-scale compression is 
followed by the coding. At this stage, we choose to compromise the requirement of achieving 
a constant bit rate, in favour of achieving a better quality. 

With respect to the quality, one could also argue that preserving a segment of 
1 5 the speech through translation could introduce discontinuities if the connecting segments on 
its both sides are distorted. By detecting voiced onsets early, which implies that the translated 
segment will start with a part of the unvoiced speech preceding the onset it is possible to 
minimise the effect of such discontinuities. It will be appreciated also that SOLA'S slow 
convergence for moderate compression rates, which ensures that the terminating part of the 
20 translated speech will include some of the voiced speech succeeding the onset. 

It will be appreciated that during the compression each incoming Sa samples 
long frame will produce an Ss or Sa + ^/./ (Jd <Sa ) samples long frame at the output. Hence, 
in order to reinstate the original time-scale, the speech coming from the expander should 
desirably comprise of Sa samples long frames, or frames having different lengths but 
25 producing the same total length of m • Sa, with m being the number of iterations. The present 
discussion is with regard to a realisation which is capable of only approximating the desired 
length and is the result of a pragmatic choice, allowing us to simplify the operations and 
avoid introducing further algorithmical delay. It will be appreciated that alternative 
methodology may be deemed necessary for differing applications. 
30 In the following, we shall assume to have disposal over several separate 

buffers, all of which will be updated by simple shifting of samples. For the sake of 
illustration, we shall be showing the complete "information triplets" as produced by the 
compressor, including the Ids acquired during compression of unvoiced sounds, most of 
which are actually obsolete. 
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This is also illustrated in Figure 12, where an initial state is shown. The buffer 
for incoming speech is represented by segment /Hq M , which is 4iSfl samples long. For the 
sake of illustration, it is assumed the expansion directly follows the compression described in 
Figure 10. Two additional buffers 4^ and 7 will serve, respectively, to provide the input 
5 information for the LPC analysis and to facilitate expansion of voiced parts. Another two 
buffers are deployed to hold the synchronisation parameters, namely the voicing decisions 
and Jfs. The flow of these parameters vnll be used as a criterion to identify the incoming 
speech frames and handle them appropriately. From now on, we shall refer to positions 0, 1 
and 2 as past, present and future, respectively. 
10 During the expansion, some typical actions will be performed on the "present" 

frame, invoked by particular states of the buffers containing the synchronisation parameters. 
In the following, this is clarified through examples. 

i. Unvoiced expansion 

1 5 The parametric expansion method previously described is exclusively 

deployed in the situation where all three frames of interest are unvoiced, as shown in Figure 
13. This implies, d(A^a^) =Ss, d{a^a^) =Ss and dia^a^) =Sa or Sa+ k[l]. Later, an 
additional requirement will also be introduced and explained, stating that these frames should 
not form an immediate continuation of a voiced offset (transition from voiced to unvoiced 

20 speech). 

Hence, the present frame a^a2 is extended to the length of Sa samples and 
output, which is followed by left shifting the buffer contents by Ss samples, making a^a^ 
new present frame and updating the contents of the "LPC buffer" . (Typically, 
rf(il)«25j. 

25 

ii. Voiced Expansion 

A possible voicing state invoking this expansion method is illustrated in 

Figure 14. Let us first assume that the compressed signal starts with 0^02 i.e. that OqO^ , v[0] 
and k[0] are empty. Then, 7 and A" exactly represent the first two frames of a time-scale 
30 "reconstruction" process. In this "reconstruction" process, 2Sa samples long frames x,. with in 
this case K = jCq , X- , need to be excised from the compressed signal at position iS^ + k\ 
and "put back" at the original positions iSay while cross-fading the overlapping samples. The 
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first 5a SEunples of K are not used during the overlapped, so they, are output. This can be 



viewed as expansion of 3$ samples long frame a^a^ , which is then replaced by its successor 



a^a^ by the usual left-shifting. It is now clear that all consecutive iSs samples long ft-ames can 
be expanded in the analogue way, i.e. by outputting fu-st 5a samples from buffer K where the 
5 rest of this buffer is continuously up-dated through overlap-add with A^' obtained for a certain 
present i.e. A:[l]. Explicitly, A' will contain 25a samples from the input buffer, starting with 
5s + A[l]-th sample. 

iii. Translation 

10 As detailed previously the term " translation" as used within the present 

specification is intended to refer to all situations where the present frame, or a part of it, is 
output as is or skipped, i.e. shifted but not output. Figure 15 shows that at the time the 



unvoiced frame a^^j has bedome the present frame, its front 5a-5s samples will already have 
been output during the previous iteration. Namely, these samples are included in the front 5a 



1 5 samples of Y. which have been output during the expansion of aj^j . Consequently, 

expanding a present unvoiced frame that follows a past voiced frame using the parametric 
method would disturb speech continuity. Therefore, we first decide to maintain voiced 
expansion during such voiced offsets. In other words, the voiced expansion is prolonged to 
the first unvoiced frarhe succeeding a voiced frame. This will not activate the "tonality 

20 problem", which is primarily caused when "repetition" of SOLA expansion extends over a 
relatively longer unvoiced segment. ... 

. However, it is clear that the above outlined problem will now only be 



postponed and will re-appear with the future frame a^a^ . Keeping in mind the .way voicing 
expansion is performed, i.e. the way Y is updated, a total ofkt(0<k< 5a) S£imples may have 
25 already been output (modified by cross-fade) before they have arrived at the front of the 
buffer. 

In order to obviate this problem firstly, each present kj samples that have been 
used in the past is skipped. This now implies a deviation from the principle exploited so far, 
where for each incoming 5s samples 5a samples are output. In order to compensate "the 
30 shortage" of samples", we shall use the "surplus" of samples contained in the translated 5a + 
kj samples long frames produced by the compressor. If such a frame does not directly follow 
a voiced offset (if a voiced onset does not appear shortly after a voiced offset) then none of 
its samples will have been used in the previous iterations, and it C2m be output as a whole. 
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Hence, the "shortage" of ki samples following a voiced offset will be counterbalanced by a 

"surplus" of at most kj samples proceeding the next voiced onset. 

Since both *j and k\ are obtained during compression of unvoiced speech, 

therefore having a random-like character, their counterbalance will not be exact for a 
5 particular J and i. As a consequence, a slight mismatch between the duration of the original 

and the corresponding companded unvoiced sounds will generally result, which is expected 

10 be not perceivable. At the same time, speech continuity is assured. 

It should be noted that the mismatch problem could easily be tackled even 

vviihoui inu-oducing additional delay and processing, by choosing the same k for all unvoiced 
1 0 frames during the compression. Possible quality degradation due to this action is expected to 

remain bounded, since waveform similarity, based on which k is computed, is not an 

essential similarity measure for unvoiced speech. 

It should be noted that it is desirable for all the buffers to be consistently 

updated, in order to ensure speech continuity when switching between different actions. For 
1 5 the purpose of this switching and identification of incoming frames, a decision mechanism 

has been established, based on inspecting the states of voicing and "A:-bufFer", It can be 

summarised through the table given below, where the previously described actions are 

abbreviated. To signal "re-usage" of samples, i.e. occurrence of a voiced offset in the past, an 

additional predicate named "offset" is introduced. It can be defined by looking one step 
20 further into the past of the voicing buffer, as true if v[0] =1 vv[- 1] = 1 and false in all other 

cases (v denotes logical "or"). Note that through suitable manipulati:on, no explicit memory 

location for v[- 1] is needed. 
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Table 1 Selecting actions of the expander 
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It will be appreciated that the present invention utilises a time-scale expansion 
method for unvoiced speech. Unvoiced speech ia compressed with SOLA, but expanded by 
insertion of noise with the spectral shape and the gain of its adjacent segments. This avoids 
the artificial correlation which is introduced by "re-using" unvoiced segments. 
5 If TSM is combined with speech coders that operate at lower bit rates (i.e. < 8 

kbit/s), the TSM-based coding performs worse compared to conventional coding (in this case 
AMR). If the speech coder is operating at higher bit rates, a comparable performance can be 
achieved. This can have several benefits. The bit rate of a speech coder with a fixed bit rate 
can now be lowered to any arbitrary bit rate by using higher compression ratios. By 
1 0 compression ratios up to 25 %, the performance of the TSM system can be comparable to a 
dedicated speech coder. Since the compression ratio can be varied in time, the bit rate of the 
TSM system can also be varied in time. For example, in case of network congestion, the bit 
rate can be temporarily lowered. The bit stream syntax of this speech coder is not changed by 
the TSM. Therefore, standardised speech coders can be used in a bit stream compatible 
1 5 maimer. Furthermore, TSM can be used for error concealment in case of erroneous 

transmission or storage. If a frame is received erroneously, the adjacent firames can be time- 
scale expanded more in order to fill the gap introduced by the erroneous firame. 

It has been shown that most of the problems accompanying time-scale 
companding occur during the imvoiced segments and voiced onsets that are present in a 
20 speech signal. In the output signal, the unvoiced sounds take on a tonal character, while less 
gradual and smooth voiced onsets are often smeared, especially when larger scale factors are 
used. The tonality in unvoiced sounds is introduced by the "repetition" mechanism which is 
inherently present in all time-domain algorithms. To overcome this problem, the present 
invention provides separate methods for expanding voiced and unvoiced speech. A method is 
25 provided for expansion of unvoiced speech, which is based on inserting an appropriately 

shaped noise sequence into the compressed unvoiced sequences. To avoid smearing of voiced 
onsets, the voice onsets are excluded from TSM and are then uanslated. 

The combination of these concepts with SOLA, has enabled the realisation of 
a time-scale compeuiding system which outperforms the traditional realisations that use a 
30 similar algorithm for both compression and expansion. 

It will be appreciated that the introduction of a speech codec between the 
TSM stages may cause quality degradation, being more noticeable in proportion to the 
lowering of the bit-rate of the codec. When a particular codec and TSM are combined to 
produce a certain bit-rate, the resulting system performs worse than dedicated speech coders 
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operating at a comparable bit-rate. At lower bit-rates, quality degradation is unacceptable. 
However, TSM can be beneficial in providing graceful degradation at higher bit-rates. 

Although hereinbefore described with reference to one specific 
implementation it will be appreciated that several modifications are possible. Refinements of 

5 the proposed expansion method for unvoiced speech through deploying alternative ways of 
noise insertion and gain computation could be utilised.. 

Similarly, although the description of the invention is mainly addressed to 
time scale expanding a speech signal, the invention is further applicable to other signals such 
as but not limited to an audio signal. 

10 It should be noted that the above-mentioned embodiments illustrate rather than 

limit the invention, and that those skilled in the art will be able to design many alternative 
embodiments without departing from the scope of the appended claims. In the claims, any 
reference signs placed between parentheses shall not be construed as limiting the claim. The 
word 'comprising' does not exclude the presence of other elements or steps than those listed 

15 in a claim. The invention can be implemented by means of hardware comprising several 
distinct elements, and by means of a suitably programmed computer. In a device claim 
enumerating several means, several of these means can be embodied by one and the same 
item of hardware. The mere fact that certain measures are recited in mutually different 
dependent claims does not indicate that a combination of these measures cannot be used to 

20 adv£mtage. 
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CLAIMS: 



1 . A method of time scale modifying a signal, the method comprising the steps 
of: 

a) defining individual frame segments within the signal, 

b) analysing the individual frame segnients to determine a signal type in each frame 
5 segment, and 

c) applying a first algorithm to a determined first signal type and a second different 
algorithm to a determined second signal type. 

2. The method as claimed in claim 1 wherein the first signal type is a voiced 
10 signal segment and the second signal type is an un-voiced signal segment. 

3. The method as claimed in claim 1 or claim 2 wherein the first algorithm is 
based on a waveform technique and the second algorithm is based on a parametric technique. 

15 4. The method as claimed in any preceding claim wherein the first algorithm is a 

SOLA algorithm. 

5. The method as claimed in any preceding claim wherein the second algorithm 
comprises the steps of: 

20 a) dividing each firame of the determined second signal type into a lead in and a lead out 
portion, 

b) generating a noise signal, and 

c) inserting the noise signal between the lead-in and lead-out portions so as to effect an 
expanded segment. 

25 

6. The method as claimed in any preceding claim wherein the first and second 
algorithms are expansion algorithms and the method is used for time scale expanding a 
signal. 
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7. The method as claimed in any one of claims 1 to 5 wherein the first and 

second algorithms are compression algorithms and the method is used for time scale 
compressing a signal. 

5 8. A method as claimed in claim 1, wherein the signal is a time scale modified 

audio signal. 

9. A method of time scale expanding a signal comprising the steps of: 

a) splitting the signal in a first portion and a second portion, and 

1 0 b) inserting noise in between the first portion and the second portion to obtain a time scale 
expanded signal. 

10. A method as claimed in any preceding claim, wherein the signal is an audio 
signal and in particular unvoiced segments are time scale expanded. 

15 

11. A method as claimed in claim 9, wherein the noise is synthetic noise with a 
spectral shape equivalent to the spectral shape of the first and second portions of the signal. 

1 2. A method of receiving an audio signal, the method comprising the steps of 
20 a) decoding the audio signal, and 

b) time scale expanding the decoded audio signal according to a method as claimed in claim 
1. 



13. A time scale modifying device adapted to modify a signal so as to effect the 
25 formation of a time scale modified signal comprising: 

a) means for determining different signal types within frames of the signal, and 

b) means for applying a first modification algorithm to frames having a first determined 
signal type and a second different modification algorithm to frames having a second 
determined signal type. 

30 

14. The device as claimed in claim 13 wherein the means for applying a second 
different modification algorithm to the second determined signal type comprises: 

a) means for splitting the signal frame in a first portion and a second portion, and 
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b) means for inserting noise in between the first portion and the second portion to obtain a 
time scale expanded signal. 

15, A receiver for receiving an audio signal, the receiver comprising: 

a) a decoder for decoding the audio signal, and 

b) a device according to claim 13 or claim 14 for time scale expanding the decoded audio 
signal. 
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